The purpose of this tutorial is to scope out the software reporting to do EDA in R. We want to understand the capabilities and the limitations.
🔧 Preparation
The reading for this week is The Landscape of R Packages for Automated Exploratory Data Analysis. This is a lovely summary of software available that is considered to do exploratory data analysis (EDA). (Note: Dr Cook considers these to be mostly descriptive statistics packages, not exploratory data analysis in the true spirit of the term.) This reading will be the basis of the tutorial exercises today.
Complete the weekly quiz, before the deadline!
Install this list of R packages, in addition to what you installed in the previous weeks:
Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.
📥 Exercises
The article lists a number of R packages that might be used for EDA: arsenal, autoEDA, DataExplorer, dataMaid, dlookr, ExPanDaR, explore, exploreR, funModeling, inspectdf, RtutoR, SmartEDA, summarytools, visdat, xray.
1.
What package had the highest number of CRAN downloads as of 12.07.2019? (Based on the paper.)
Solution
summarytools with 84737
2.
Open up the shiny server for checking download rates at https://hadley.shinyapps.io/cran-downloads/. Which of these packages has the highest download rate over the period Jan 1, 2024-today?
smartEDA was on a roll early in the year, but has virtually disappeared. summarytools is going strong. Interestingly, visdat, high on the list, was developed by Nick Tierney in the years he was at Monash.
3.
What is an interesting pattern to observe from the time series plot of all the downloads?
Solution
The weekly seasonality! There is a regular up/down pattern, that if you zoom in closely - try plotting just a couple of weeks of data - you can see corresponds to week day vs weekend.
4.
How many functions does Staniak and Biecek (2019) say visdat has for doing EDA? Explore what each of them does, by running the example code for each function. What do you think are the features that make visdat a really popular package?
Solution
6; Simple focus, useful functions that apply to a lot of problems.
It gives an overview of the variable types, and missing values. This is useful to start a data analysis so that you can start working out what methods might be applied to each, and also whether some variables may have too many missing values to be analysable.
library(visdat)# function 1vis_dat(airquality)
# function 2messy_vector <-c(TRUE,"TRUE","T","01/01/01","01/01/2001",NA,NaN,"NA","Na","na","10",10,"10.1",10.1,"abc","$%TG")set.seed(1114)messy_df <-data.frame(var1 = messy_vector,var2 =sample(messy_vector),var3 =sample(messy_vector))vis_guess(messy_df)
# function 3vis_miss(airquality)
# function 4aq_diff <- airqualityaq_diff[1:10, 1:2] <-NAvis_compare(airquality, aq_diff)
# function 5dat_test <- tibble::tribble(~x, ~y,-1, "A",0, "B",1, "C",NA, NA )vis_expect(dat_test, ~.x ==-1)
# function 6vis_cor(airquality)
5.
The package summarytools appears to becoming more favourable relative to visdat. Take a look at this package and explain what tools it has that are not available in visdat.
It has a lot more standard statistical summaries, and outputs the summary in a form that can be embedded nicely in a quarto report.
If you go to the package website https://github.com/dcomtois/summarytools it’s not very promising. There is no package website only the GitHub. There have been no updates for 8 months.
6.
Why do you think the package SmartEDA has gone out of favour?
Solution
The web site looks a little uncared for, missing images on the main GitHub page. It has a nice package pacge, but again pieces are missing.
library(SmartEDA)ExpData(data=mtcars,type=1)
Descriptions Value
1 Sample size (nrow) 32
2 No. of variables (ncol) 11
3 No. of numeric/interger variables 11
4 No. of factor variables 0
5 No. of text variables 0
6 No. of logical variables 0
7 No. of identifier variables 0
8 No. of date variables 0
9 No. of zero variance variables (uniform) 0
10 %. of variables having complete cases 100% (11)
11 %. of variables having >0% and <50% missing cases 0% (0)
12 %. of variables having >=50% and <90% missing cases 0% (0)
13 %. of variables having >=90% missing cases 0% (0)
It also has some standard summaries like summarytools which is possibly why it became attractive. And can be used to produce a detailed report on the data, although a report like the one in the next question took ages to complete.
I’m not sure why it’s popularity has dropped. The documentation is not great, there’s quite a few spelling errors. The functions are (sort of) in CamelCase which is not as attractive as snake_case.
7.
The SmartEDA::ExpReport() and DataExplorer::create_report() are functions that generates a long report when provided with a data set. Try this using the latter function, on the nycflights13 data, using this code:
# DataExplorerlibrary(DataExplorer)library(nycflights13)library(tidyverse)# Create a big data setairlines_all <- flights |>full_join(airlines, by ="carrier") |>full_join(planes, by ="tailnum", suffix =c("_flights", "_planes")) |>full_join(airports, by =c("origin"="faa"), suffix =c("_carrier", "_origin")) |>full_join(airports, by =c("dest"="faa"), suffix =c("_origin", "_dest"))
create_report(airlines_all, y ="arr_delay")
The code instructs to use arr_delay as a “response variable”. Give some reasons why this report is not very useful!
Solution
There is a lot of information, and many of the plots and tables are particularly unreadable.
Have your Generative AI assistant suggest what to look at when it is given the variable summary, from:
glimpse(airlines_all)
Solution
Here’s what I got from Claude:
Based on the dataset you've provided, which appears to be about airline flights, I can suggest several plots and summaries that would be informative. Here are some recommendations:
Time-based analysis:
Line plot of average delays (dep_delay and arr_delay) by month and year
Heatmap of delays by day of week and hour of day
Box plots of air_time by month
Carrier analysis:
Bar plot of flight counts by carrier
Box plots of delays by carrier
Scatter plot of departure delay vs. arrival delay, colored by carrier
Route analysis:
Bar plot of most common routes (origin-dest pairs)
Map visualization of routes with line thickness indicating frequency
Distance analysis:
Scatter plot of distance vs. air_time
Histogram of flight distances
Aircraft analysis:
Bar plot of most common aircraft models
Box plot of delays by aircraft manufacturer
Airport analysis:
Bar plot of busiest airports (by departure and arrival)
Map of airports with point size indicating traffic volume
Summaries:
Table of average delays by carrier
Summary statistics (mean, median, std dev) for air_time and distance
Correlation matrix of numeric variables
Time series:
Time series decomposition of daily flight counts
Categorical data:
Stacked bar charts for engine types by manufacturer
Performance metrics:
Scatter plot of scheduled arrival time vs. actual arrival time
It’s nicely organise by different types of analysis activities that might be done. Each of these require different ways to aggregate the data, to make summaries or plots by various categorical variables. But it doesn’t suggest pre-processing that might be necessary, so I asked it to tell me what might be needed:
Before creating visualizations and summaries, some pre-processing steps would be beneficial for this dataset. Here are some recommendations:
Handle missing values:
Check for NA or null values, especially in columns like speed, air_time, and delay fields.
Decide whether to impute missing values or remove rows with missing data, depending on the extent and nature of missingness.
Date and time formatting:
Combine year_flights, month, and day into a single date column.
Ensure time_hour is properly formatted as a datetime object.
Categorical data:
Convert appropriate columns to factors (e.g., carrier, origin, dest, manufacturer).
Check for and handle any inconsistent spellings or codings in categorical variables.
Delay calculations:
Create a total_delay column by summing dep_delay and arr_delay.
Consider creating a binary column for whether a flight was delayed (e.g., over 15 minutes).
Outlier detection:
Check for and possibly handle extreme values in delay times, air_time, and distance.
Feature engineering:
Create a day_of_week column from the date.
Calculate flight speed from distance and air_time.
Create a season column based on month.
Data type conversions:
Ensure numeric columns are of the appropriate type (int or float).
Convert time columns (dep_time, arr_time, etc.) to a more usable format if they're not already.
Geocoding:
Ensure lat and lon columns for both origin and destination are numeric and contain valid coordinates.
Consistency checks:
Verify that arr_time is always later than dep_time.
Check that air_time is consistent with the difference between arr_time and dep_time.
Merging data:
If the planes and airports data are in separate datasets, merge them carefully with the main flights data.
Creating subsets:
Consider creating smaller datasets for specific analyses (e.g., by year, by carrier) to improve processing speed.
This is quite useful!
8.
In a limited fashion, lets work through some areas suggested by Claude.
Use visdat to examine the variable types and missing values. You’ll need to take a sample of the data because there are too many observations to reasonably plot. However, a sample should give reasonable insight on the reliability of most variables. What variables may not be useful because they have too many missing values?
speed is mostly missing. It definitely cannot be used.
type, manufacturer, model, engine, year_planes, engines, seats have a lot of missings, and so maybe unusable. It is possible that these missings are for some types of planes, and other types of planes have more complete records.
For the rest of these questions, you decide how to process the data, make summaries or plots to provide answers.
Which carrier had the most flights?
Is this the same for each month? Or day of the week?
Are there more departure delays for flights in the morning hours, or evening hours?
Find an error in the data, e.g. a flight that arrived before it left.
With your neighbour in the tutorial come up with one thing that is a bit surprising to you that you can learn from this data. Make sure you state what you expected to see, and why what you saw was then a surprise. (It is possible that you can use the DataExplorer report to look at something you had not thought to examine, as motivation.)
Solution
airlines_all |>count(carrier, sort=TRUE)
# A tibble: 17 × 2
carrier n
<chr> <int>
1 UA 58665
2 B6 54635
3 EV 54173
4 DL 48110
5 AA 32729
6 MQ 26397
7 US 20536
8 9E 18460
9 WN 12275
10 VX 5162
11 FL 3260
12 <NA> 2812
13 AS 714
14 F9 685
15 YV 601
16 HA 342
17 OO 32
There are some flights that leave substantially early, maybe not a mistake but worrying for a traveller.
What did you find?
9.
Table 2 of the Landscape paper summarises the activities of two early phases of the CRISP-DM standard. What does CRISP-DM mean? The implication is that EDA is related to “data understanding” and “data preparation”. Would you agree with this or disagree? Why?
Solution
Cross-Industry Standard Process for Data Mining; EDA techniques can be useful for some parts of these stages, for example finding outliers, or examining missing value patterns. Some of these steps are important for effective EDA, too, for example, you need to know what types of variables you have in order to decide what types of plots to make.
10.
Table 1 of the paper is summarising CRAN downloads and GitHub activity is hard to read. How are the rows sorted? What is the most important information communicated by the table? In what way(s) might revising this table make it easier to read and digest the most important information?
Solution
Most important information is the download rate because the purpose is to know which are the commonly used packages. Sorting rows by downloads makes the table easier to read.
👌 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.